What is Open Street Map
Open Street Map (OSM) is a “collaborative project to create a free editable geographic database of the world” (Wikipedia). OSM is free and has been adapted widely in academia as a reliable source of data. OSM provides various types of data, ranging from raster map tiles to geographic features in vector forms. We will focus on the vector data.
Overview
This document will walk to you through:
- How to download OSM vector data.
- How to convert the cleaned geometry vectors into a network.
- How to calculate centrality measures.
- How to do route finding (shortest path).
Downloading the data
We will first define the bounding box within which we will get OSM
data using nominatimlite package. It uses the Nominatim API
which is a geocoding service that powers the official OSM site. We used
to get Census city boundary using tigris package, but since
we are learning OSM, we can get the city boundary from OSM as well. Of
course, you can create your own bounding box - we do it a few code
chunks below.
# Get bounding box coordinates for Atlanta
bb <- nominatimlite::geo_lite_sf('Atlanta, GA', points_only = F) %>%
st_bbox()
bb_sf <- bb %>% st_as_sfc()
## Plot--
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(bb_sf) + tm_borders()
OSM wiki states that OMS represents physical features on the ground using tags attached to its basic data structures (i.e., nodes, ways, and relations). Although there can be infinite number of attributes (i.e., tags) describing each feature, there are some key-value pairs that are agreed upon by the community and are widely used. Although it might not be accurate description, but thinking the key as a category and value as entries within the category would be reasonably accurate description. Some of the widely used keys include Amenity, Buildings, Highway, and Shop. Then there are values attached to each of the key. For Amenity, there are education, transportation, financial, healthcare, and entertainment, etc.
For Highway (which means road in OSM), the followings are some examples of the values:
- motorway
- trunk
- primary
- secondary
- tertiary
- residential
- various links, including motorway_link, trunk_link, primary_link, etc.
For the full list of the key-pair values, please refer to OSMwiki. To download all
possible key:value pairs for highway, you can insert
available_tags("highway") instead of manually specifying a
list of values. One caveat is that, using all available tags will
generate a large data, significantly slowing down the processing speed.
Here, we download seven values under key = ‘highway’ and convert it into
sf format. The last function in the code below,
osm_poly2line() is added because “Street networks
downloaded with add_osm_object(key =”highway”) will store any circular
highways in osm_polygons. this function combines those with the
osm_lines component to yield a single sf data.frame of all highways,
whether polygonal or not” (source).
# Get OSM road data
osm_road <- opq(bbox = bb) %>%
add_osm_feature(key = 'highway',
value = c("motorway", "motorway_link",
"trunk", "trunk_link",
"primary", "primary_link",
"secondary", "secondary_link",
"tertiary", "residential")) %>%
osmdata_sf() %>%
osm_poly2line()
names(osm_road)
## [1] "bbox" "overpass_call" "meta"
## [4] "osm_points" "osm_lines" "osm_polygons"
## [7] "osm_multilines" "osm_multipolygons"
This object osm_road is a list that has multiple
elements, including bbox, osm_points, osm_lines, osm_polygons, etc. We
can extract osm_lines and see what it looks like. The
prop.table() function below shows that ‘residential’ tag is
roughly 70%.
## Plot--
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(osm_road$osm_lines) + tm_lines(col = "highway")
# Breakdown of highway types
round(prop.table(table(osm_road$osm_lines$highway)) * 100, 1)
##
## motorway motorway_link primary primary_link residential
## 3.3 4.6 5.5 0.6 63.4
## secondary secondary_link tertiary trunk trunk_link
## 10.8 0.8 7.9 2.9 0.2
Defining a custom bounding box
Displaying the entire network we just downloaded can slow down R environment. For class exercise (for students to be able to follow along), we will limit the bounding box to a smaller area. We will create a custom bbox using coordinates of our selection. This technique of generating your own bounding box can also be useful if you have a specific area of interest that doesn’t overlap well with commonly used boundaries.
You need to define two points for bounding box, one point at the
lower left corner and the other at the upper right corner. You can go to
Google Maps, right-click on a point on map, and copy the XY coordinate.
Let’s store them in p1 and p2. Notice that
Google Maps will give you YX coords, but we need p1 and
p2 in XY format.
# p1 is lower left corner, p2 is the upper right corner
p1 <- c(-84.42138164090896, 33.729724924925925)
p2 <- c( -84.34372362786227, 33.79761117884929)
You will then need to format the two coordinates into the following format. Then, let’s convert it into sf object.
# Custom BB
my_bb <- c(st_point(p1) %>% st_sfc(crs = 4326),
st_point(p2) %>% st_sfc(crs = 4326)) %>% st_bbox()
# Custom BB to sf
my_bb_sf <- my_bb %>% st_as_sfc()
# Extract a smaller network for exercise purpose
osm_small <- osm_road$osm_lines[my_bb_sf,] %>% select(osm_id, highway)
## Plot--
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(bb_sf) + tm_borders(col = "black") + # Black = larger bbox
tm_shape(my_bb_sf) + tm_borders(col = "red") + # Red = smaller bbox
tm_shape(osm_small) + tm_lines(col = "black") # Black line = small network
Converting OSM data into network using sfnetworks
package
Now we have OSM network geometry. This network geometry, however, has
many issues that need to be resolved before it can be used for analysis.
We will use sfnetworks package to do both cleaning and
analysis. We can pass osm_small to
as_sfnetwork() to convert it to sfnetwork object. It is
important to note that the sf object you pass to
as_sfnetwork() function should be LINESTRING.
When using as_sfnetwork(), it creates a directed
network by default. For simplicity, we will set this to FALSE.
You can see that the output net object has two parts:
nodes and edges. Also notice that the
node data is denoted as ‘active’. The sfnetworks objects is made of
nodes and edges that connect the nodes. When you do data manipulation
using, for example, dplyr verbs such as mutate, you can choose whether
to apply it to nodes or edges by activating the one
that you want to modify.
# Converting the line component of OSM data into a network
net <- sfnetworks::as_sfnetwork(osm_small, directed = FALSE)
print(net)
## # A sfnetwork with 4543 nodes and 4228 edges
## #
## # CRS: EPSG:4326
## #
## # An undirected multigraph with 816 components with spatially explicit edges
## #
## # A tibble: 4,543 × 1
## geometry
## <POINT [°]>
## 1 (-84.34923 33.74581)
## 2 (-84.35038 33.7454)
## 3 (-84.34893 33.76049)
## 4 (-84.34917 33.76218)
## 5 (-84.34917 33.76184)
## 6 (-84.34892 33.76025)
## # ℹ 4,537 more rows
## #
## # A tibble: 4,228 × 5
## from to osm_id highway geometry
## <int> <int> <chr> <chr> <LINESTRING [°]>
## 1 1 2 9164335 motorway_link (-84.34923 33.74581, -84.3491 33.74624, -84…
## 2 3 4 9164434 trunk_link (-84.34893 33.76049, -84.34893 33.76072, -8…
## 3 5 6 9164438 trunk_link (-84.34917 33.76184, -84.34909 33.7619, -84…
## # ℹ 4,225 more rows
print(paste0('Which one is active?: ', sfnetworks::active(net)))
## [1] "Which one is active?: nodes"
You can also extract nodes and edges as separate sf objects using
st_as_sf() function. To extract sf objects representing
edges and nodes and store them separately, you can do (1)
net %>% activate("nodes") %>% st_as_sf() or,
equivalently, (2) net %>% st_as_sf("nodes"). The result
is the same. In the leaflet() function below which takes sf
objects but not sfnetworks objects, you need to extract
nodes and edges separately and pass the extraction. The yellow points
are nodes and gray lines are edges.
## Plot--
leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
addPolylines(data = net %>% st_as_sf('edges'),
col = "gray",
weight = 3,
opacity = 0.9,
popup = net %>% st_as_sf('edges') %>% pull(osm_id)) %>%
addCircles(data = net %>% st_as_sf('nodes'),
fillColor = 'yellow',
stroke = F,
radius = 20,
fillOpacity = 0.7)
Cleaning network
Simplifying network
A network data may contain edges that connect the same pair of nodes,
called multiple edges. There can also be loops that starts and
ends at the same node (e.g., think of a cul-de-sac). The former case can
be detected using edge_is_multiple() and the latter using
edge_is_loop().
When removing a set of multiple edges using functions shown below, they always keep the first edge and discard others. By arranging the order of edges for each set of multiple edges, you can specify which edge you want to preserve.
This way of simplification means that, when the multiple edges within
a set have different attributes, all attribute information except for
the preserved one would be lost. In such cases, you can merge
those edges. Then, the output would have the geometry of the first edge
in the set, but the attributes would be some summary of all the edges in
the set. to_spatial_simple() function does this work (but
not used in this document).
# Let's simplify our network
simple_net <- net %>%
activate("edges") %>%
filter(!edge_is_multiple()) %>%
filter(!edge_is_loop())
# Because the difference is not really discernible, we just print out the differences in the edge count.
message(str_c("Before simplification, there were ", net %>% st_as_sf("edges") %>% nrow(), " edges. \n",
"After simplification, there are ", simple_net %>% st_as_sf("edges") %>% nrow(), " edges."))
## Before simplification, there were 4228 edges.
## After simplification, there are 4197 edges.
Subdivide edges
Why subdivide?
When as_sfnetwork() function converts an sf linestrings,
the nodes are defined as the endpoints of each linestring. If
you zoom into Midtown in the map above, there are many edges that
crosses (i.e., intersections) but do not have nodes. The absence of
nodes at what appears to be intersections would be legitimate data if
the two crossing roads are, for example, bridges that have different
elevations. However, we anecdotally know that there aren’t that many
such overhead bridges in our study area.
We can use to_spatial_subdivision predicate to create
points at the intersections and cut the edges accordingly.
“Construct a subdivision of the network by subdividing edges at each interior point that is equal to any other interior or boundary point in the edges table. Interior points in this sense are those points that are included in their linestring geometry feature but are not endpoints of it, while boundary points are the endpoints of the linestrings. The network is reconstructed after subdivision such that edges are connected at the points of subdivision.” (source)
This to_spatial_subdivision function is a part of
functions called spatial
morphers. Morphers are rooted in tidygraph, which allows you to
temporarily morph the topology of the original network, do some work to
the “morphed state” of the network, and merge the results back to the
original network using unmorph() (See this
document from which I borrowed heavily). It is by
design that morphers are used not directly but by passing them to
morph() function.
Although morphing is meant to be temporary, we can also make the
changes permanent by passing them to
convert() function instead of morph() function
because we want to make permanent changes to our network (i.e., create
nodes at intersections), we will use convert().
# # Using spatial morpher
subdiv_net <- convert(simple_net, sfnetworks::to_spatial_subdivision)
## Warning: to_spatial_subdivision assumes attributes are constant over geometries
# -------------------------------------------
# Below is done for visualization purpose
subdiv_net <- subdiv_net %>%
activate("nodes") %>%
mutate(custom_id = seq(1, subdiv_net %>% st_as_sf("nodes") %>% nrow()),
is.new = case_when(is.na(.tidygraph_node_index) ~ "new nodes",
TRUE ~ "existing nodes"),
is.new = factor(is.new))
## Plot--
subdiv_pal <- colorFactor(palette = c("yellow", "red"), domain = subdiv_net %>% st_as_sf("nodes") %>% pull(is.new))
subdiv_map <- leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
addPolylines(data = subdiv_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9) %>%
addCircles(data = subdiv_net %>% st_as_sf('nodes'), fillColor = ~subdiv_pal(is.new), stroke = F, radius = 20, fillOpacity = 0.7) %>%
addLegend(position = "bottomright", pal = subdiv_pal, values = subdiv_net %>% st_as_sf("nodes") %>% pull(is.new))
subdiv_map
Delete pseudo nodes
Notice that there are nodes on edges that are unnecessary nodes called pseudo nodes. For directed networks, pseudo nodes are those that have only one incoming and one outgoing edge. For undirected networks, pseudo nodes are those that have two incident edges. If these edges connected by these pseudo nodes are identical in their attributes, these nodes are unnecessary in any ways. These nodes can be delete using to_spatial_smooth predicate.
# Using spatial morpher
smoothed_net <- convert(subdiv_net, sfnetworks::to_spatial_smooth)
# -------------------------------------------
# Below is done for visualization purpose
# Extract removed points for mapping purposes
removed <- setdiff(subdiv_net %>% st_as_sf('nodes') %>% pull(custom_id),
smoothed_net %>% st_as_sf('nodes') %>% pull(custom_id))
smooth_map <- leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
addPolylines(data = smoothed_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9, popup = smoothed_net %>% st_as_sf('edges') %>% rownames()) %>%
addCircles(data = smoothed_net %>% st_as_sf('nodes'), fillColor = 'yellow', stroke = F, radius = 20, fillOpacity = 0.7, group = "kept") %>%
addCircles(data = subdiv_net %>% st_as_sf("nodes") %>% filter(custom_id %in% removed),
fillColor = "red", stroke = F, radius = 15, fillOpacity = 0.4, group = "removed") %>%
addControl(html = htmltools::HTML("<b>Red points denote removed nodes</b>"), position = "bottomright") %>%
addLayersControl(overlayGroups = c("kept", "removed"))
smooth_map
Calculate centrality
One of the most common ways for analyzing the characteristics of networks are centrality measures. These algorithms assigns some numbers to nodes within a graph based on their position in the given network. There are many centrality measures, each with their own characteristics, as illustrated in the images below. We will try betweenness centrality and degree centrality.
Betweenness centrality measures “the percentage of shortest paths that must go through the specific node (Golbeck 2005, p.229)”.
Degree centrality is the simplest centrality measure to compute: it is a count of how many connections a node has (Golbeck 2005, p.226).
(Source: wikipedia)
(Source: Figure 1 in Cordeiro et al.
(2018))
Betweenness centrality
# Calculate centrality measures
network_char <- smoothed_net %>%
activate("edges") %>%
mutate(weight = edge_length()) %>%
mutate(edge_bc = centrality_edge_betweenness(weights = weight, directed = F)) %>%
activate("nodes") %>%
mutate(node_bc = centrality_betweenness(weights = weight, directed = F))
# Edge betweenness
bet_pal_edge <- colorNumeric(palette = "inferno", domain = network_char %>% activate("edges") %>% pull(edge_bc), n = 10)
# Node betweenness
bet_pal_node <- colorNumeric(palette = "inferno", domain = network_char %>% activate("nodes") %>% pull(node_bc), n = 10)
# Map
leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
addPolylines(data = network_char %>% st_as_sf("edges"),
color = ~bet_pal_edge(network_char %>% st_as_sf('edges') %>% pull(edge_bc)), weight = 3, opacity = 0.9) %>%
addCircles(data = network_char %>% st_as_sf("nodes"),
fillColor = ~bet_pal_node(network_char %>% st_as_sf('nodes') %>% pull(node_bc)), stroke = F, fillOpacity = 0.9,
radius = network_char %>% st_as_sf("nodes") %>% with(.$node_bc/(max(.$node_bc)/100))) # denominator is selected to make the max value roughly equal to 100
Degree centrality
Intersections are used in various measures of the built environment. For example, intersection density (number of intersections divided by the area) is used to calculate a walkability index (Frank et al., 2005) as well as in the well-known 5D framework (Ewing and Cervero, 2010). Remember that (unweighted) degree centrality simply means the number of edges each node has. Because we have cleaned the network by dropping pseudo nodes, the intersection would be those nodes with degree centrality greater than 1.
intersections <- smoothed_net %>%
st_transform(crs = 4326) %>%
activate("nodes") %>%
mutate(degree = centrality_degree(weights = NULL)) %>%
filter(degree > 1)
color_palette <- colorFactor(
palette = c("#878686", "#ffffff", "#ffed78", "#ff7878"),
domain = intersections %>% st_as_sf("nodes") %>% .$degree
)
leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
addPolylines(data = smoothed_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9) %>%
addCircles(data = intersections %>% st_as_sf('nodes'),
fillColor = ~color_palette(degree),
stroke = F,
radius = ~degree * 10,
fillOpacity = 0.9)
Shortest path
One of the most widely performed task with road networks is to find a
route between A and B, often the shortest path. IT can be easily done
using sfnetworks package’s
st_network_paths().
First, you need to create sf objects with POINT geometry that
represent origin and destination. These points do not need to be ON the
network - the sf_network_paths() will find the nearest node
and use them for the calculation. You can also supply multiple
destinations to a single call of st_network_paths(), but
the origin has to be the same. Then, you can pass them to
st_network_paths() along with the network. The output is a
data frame (a tibble to be more specific) with two columns, node_paths
and edge_paths. This data frame will have as many rows as the number of
origin-destination pairs. Opening up the content of the data frame, you
will notice that it is just a sequence of nodes and edges through which
the shortest path runs.
# Change crs for convenience
smoothed_net <- smoothed_net %>% st_transform(4326)
# Start point
start_p <- nominatimlite::geo_lite_sf('Georgia Tech CRC') %>% st_geometry() # same as `pull(gemetry)`
# End point
target_p1 <- nominatimlite::geo_lite_sf('Atlanta Botanical Garden') %>% st_geometry()
target_p2 <- nominatimlite::geo_lite_sf('Krog Street Market') %>% st_geometry()
# Get the shortest path
paths <- st_network_paths(smoothed_net, from = start_p, to = c(target_p1, target_p2), type = "shortest")
# Extract shortest path
paths_sf1 <- smoothed_net %>%
activate("nodes") %>%
slice(paths$node_paths[[1]])
paths_sf2 <- smoothed_net %>%
activate("nodes") %>%
slice(paths$node_paths[[2]])
# Visualize
leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
setView( -84.3854, 33.7618, zoom = 13) %>%
# Base network in grey
addPolylines(data = smoothed_net %>% st_as_sf("edges"), color = 'grey', weight = 2, opacity = 0.7) %>%
# Chosen edges
addPolylines(data = paths_sf1 %>% st_as_sf("edges"), color = "pink", weight = 10, opacity = 0.3) %>%
# Chosen nodes
addCircles(data = paths_sf1 %>% st_as_sf("nodes"), stroke = F, fillColor = "pink", fillOpacity = 0.8, radius = 50) %>%
# Chosen edges
addPolylines(data = paths_sf2 %>% st_as_sf("edges"), color = "lightgreen", weight = 10, opacity = 0.3) %>%
# Chosen nodes
addCircles(data = paths_sf2 %>% st_as_sf("nodes"), stroke = F, fillColor = "lightgreen", fillOpacity = 0.8, radius = 50)
Using GTFS and OSM together
Although GTFS and OSM are different data sources and may seem to reside in totally separate world, people in the real world frequently mix the two as they navigate in a city. For example, one might drive from home to the nearest MARTA rail transit station, park their car there, and ride transit to their work. To analyze such movement, we need to be able to somehow connect the two datasets.
For exercise, let’s examine how to compute the shortest path from one
location to MARTA rail stations. Since we are only interested in rail
stations, we first need to extract transit stops for rail. As we
discussed last week, GTFS is a relational table. The information on
whether a transit facility is for bus or rail is contained in
routes table, but it is not possible to link
stops table and routes table directly. We will
join routes -> trips -> stop_times -> stops. Then, we will
extract the stop_id of rail stations.
# Read GTFS
transit <- tidytransit::read_gtfs("https://www.itsmarta.com/google_transit_feed/google_transit.zip") %>%
tidytransit::gtfs_as_sf()
# Get stop_id for rails
rail_stops <- transit$routes %>%
filter(route_type %in% c(0,1,2)) %>%
inner_join(transit$trips, by = "route_id") %>%
inner_join(transit$stop_times, by = "trip_id") %>%
inner_join(transit$stops, by = "stop_id") %>%
group_by(stop_id) %>%
slice(1) %>%
pull(stop_id)
tm_shape(transit$stops %>% filter(stop_id %in% rail_stops)) + tm_dots(col = "red")
Now, we need the city-wide OSM network that we downloaded at the beginning.
# Using the full network file
atl_net <- osm_road$osm_lines %>%
select(osm_id, highway) %>%
sfnetworks::as_sfnetwork(directed = FALSE) %>%
activate("edges") %>%
filter(!edge_is_multiple()) %>%
filter(!edge_is_loop()) %>%
convert(., sfnetworks::to_spatial_subdivision) %>%
convert(., sfnetworks::to_spatial_smooth)
## Warning: to_spatial_subdivision assumes attributes are constant over geometries
Next, the remaining steps are:
- Define the starting point.
- Define the ending points (which are MARTA rail stations).
- Fine the shortest paths from the starting point to all ending points (i.e., stations).
- Calculate the distance of each route between the starting point to each MARTA rail stations.
- Replace those with 0 distances with some large number (e.g., maximum distance). These are those for which the algorithm could not find connecting routes.
- Find the one with the minimum distance.
- Map it!
# Change crs for convenience
atl_net <- atl_net %>% st_transform(4326) %>% activate("edges") %>% mutate(length = edge_length())
# Step 1. Start point
start_p <- nominatimlite::geo_lite_sf("IKEA Atlanta") %>% st_geometry() # I chose IKEA
# Step 2. End point
rail_stops_point <- transit$stops %>% filter(stop_id %in% rail_stops) %>% st_geometry()
# Step 3. Get the shortest path
paths <- st_network_paths(atl_net, from = start_p, to = rail_stops_point, type = "shortest")
# Step 4. Find the distances
dist_all <- map_dbl(1:nrow(paths), function(x){
atl_net %>% activate("nodes") %>% slice(paths$node_paths[[x]]) %>% st_as_sf("edges") %>% pull(length) %>% sum()
}) %>% unlist()
# Step 5. Replace zeros with a large values
if (any(dist_all == 0)){
dist_all[dist_all == 0] <- max(dist_all)
}
# Step 6. Find the smallest one
closest_index <- which.min(dist_all)
# Extract shortest path
paths_closest <- atl_net %>%
activate("nodes") %>%
slice(paths$node_paths[[closest_index]])
# Selected station
closest_station <- transit$stops %>% filter(stop_id %in% rail_stops) %>% slice(closest_index)
# Step 7. Map it!
leaflet() %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
# Base network in grey
addPolylines(data = atl_net %>% st_as_sf("edges"), color = 'grey', weight = 2, opacity = 0.7) %>%
# Chosen edges
addPolylines(data = paths_closest %>% st_as_sf("edges"), color = "red", weight = 10, opacity = 0.5) %>%
# Chosen nodes
addCircles(data = paths_closest %>% st_as_sf("nodes"), stroke = F, fillColor = "red", fillOpacity = 0.8, radius = 50) %>%
# Chosen station
addCircles(data = closest_station, fillColor = "blue", radius = 300)